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(54) WEB PAGE DECODING SYSTEM 






(57)Abstract: 







PROBLEM TO BE SOLVED: To provide a Web page decoding system 
capable of exactly extracting a text part in respect to the document of 
an HTML containing a tag having the description of a uniform resource 
locator(URL). 

SOLUTION: Basic source data comprising a Web page are extracted from 
a storage area designated by the prescribed URL, and written in a 
storage means and when the existence of the tag containing the 
description of the URL is detected out of the basic source data, the URL 
in that prescribed tag is detected. Then, source data are extracted from 
the storage area designated by that detected URL, and written in the 
storage means and the text part is extracted from all the source data 
stored in the storage means. 
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(54) WEB PAGE DECODING SYSTEM 

(57)Abstract: GE) 
PROBLEM TO BE SOLVED: To provide a Web page 
decoding system capable of exactly extracting a text part 
in respect to the document of an HTML containing a tag 
having the description of a uniform resource locator 
(URL). 

SOLUTION: Basic source data comprising a Web page 
are extracted from a storage area designated by the 
prescribed URL, and written in a storage means and 
when the existence of the tag containing the description 
of the URL is detected out of the basic source data, the 
URL in that prescribed tag is detected. Then, source 
data are extracted from the storage area designated by 
that detected URL, and written in the storage means and 
the text part is extracted from all the source data stored 
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* NOTICES * 

JPO and INPIT are not responsible for any 
damages caused by the use of this translation. 

1 This document has been translated by computer. So the translation may not reflect the 
original precisely. 

2.**** shows the word which can not be translated. 
3. In the drawings, any words are not translated. 



CLAIMS 

[Claim(s)] 

[Claim 1]A Web page decipherment system which decodes a text part of an HTML document 
which constitutes a Web (web) page, comprising: 

A means which takes out source data of foundations which constitute said Web page from a 
storage area specified by predetermined URL (uniform resource locator), and is written in a 
memory measure. 

A tag detection means which detects existence of a predetermined tag including a description 
part of URL out of source data of said foundations. 

A URL detection means to detect URL in the predetermined tag when existence of said 
predetermined tag is detected. 

A means which takes out source data from a storage area specified by URL detected by said 
URL detection means, and is written in said memory measure, and a sampling-of-text means 
to extract a text part from all the source data memorized by said memory measure. 

[Claim 2]The web page decipherment system according to claim 1, wherein said 
predetermined tag is a frame tag. 

[Claim 3]The web page decipherment system according to claim 1 or 2, wherein said URL 
detection means detects URL in syntax of <FRAMESRC="URL M > as URL in said frame tag. 
[Claim 4]The web page decipherment system according to claim 1, wherein said sampling-of- 
text means extracts portions other than a portion surrounded by <> in source data memorized 
by said memory measure as a text part. 

[Claim 5]The web page decipherment system according to claim 1 having further a voice 
output means which creates and outputs an audio signal corresponding to a text part extracted 
by said sampling-of-text means. 

[Claim 6]The web page decipherment system according to claim 1, wherein an output sound 
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signal of said voice output means is supplied to telephone via a dial-up line. 

[Translation done.] 
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* NOTICES * 

JPO and INPIT are not responsible for any 
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1. This document has been translated by computer. So the translation may not reflect the 
original precisely. 

2. **** shows the word which can not be translated. 
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DETAILED DESCRIPTION 

[Detailed Description of the Invention] 
[0001] 

[The technical field to which an invention belongs] This invention relates to the Web page 

decipherment system which decodes the text part of a Web (web) page. 

[0002] 

[Description of the Prior Art]WWW (World Wide Web) which is one of the information services 
of the Internet, Using the HTML file described in a language called HTML (Hyper Text Markup 
Language), and URL (Uniform Resource Locator) which is the identifiers of the preserving 
position of the file, via the Internet A character, an image, Multimedia information, such as a 
sound, can be referred to. It is a Web page which is formed on a display screen by processing 
an HTML file by the inspection software called a WWW browser. The WWW server associates 
and saves the HTML file by URL, and the side which provides the information on WWW 
operates according to server application. The source data which contain an HTML file by the 
side provided with information, i.e., a client, (client computer) from desired URL using a WWW 
browser. The Web page which accessed (for example, a graphics file and a voice file) via the 
Internet, and was formed by the file in sauce can be referred to on a display screen. 
[0003]The document of HTML is usually formed from the text part which consists of text 
documents, and the display information bureau formed with a tag. A tag is a sign which makes 
<> a couple and the syntax of HTML is formed like <HTML>- </HTML> using a tag. Various 
display information, including the size of the character of the text part displayed on a Web 
page, the kind of font, its character color, the background color of a Web page, an image file 
name, an image position, etc., is shown in the portion surrounded by tag <>. 
[0004]Thus, since the portion surrounded by tag <> in the document of HTML is the control 
information for displaying not the portion displayed but a document, if the portion surrounded 
by tag <> is removed, it is usually common to become a mere text document. The system 
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which, on the other hand, reads out the document of the Web page displayed on a WWW 
browser may be formed on the Internet. This is because the document of a Web page is read 
out and it sends out to telephone as an audio signal, when the terminal of a client is not a 
computer but the telephone connected to the public line, for example. Extracting the text data 
portion excluding the portion surrounded by tag <> out of the HTML file for read-aloud as Web 
page read-aloud, compounding the voice data corresponding to the character code of the text 
data portion, and outputting as a series of audio signals is performed. 
[0005] 

[Problem(s) to be Solved by the lnvention]There is <FRAMESET>- </FRAMESET> in HTML 
as syntax for forming a frame on a Web page, for example, and the divided screen is obtained 
in the Web page using this. From the HTML file the syntax was described to be, another HTML 
file is further called for every split screen, and a document is displayed. That is, URL (an HTML 
file name is included) is specified for every split screen with the tag like <FRAME solvent 
refined coal="URL">, and the contents of the HTML file which exists in the specified field of 
URL are displayed. 

[0006]However, there was a problem that a text part could not be correctly extracted to the 
document of HTML containing the tag which has description of URL like such a frame tag 
before. Then, the purpose of this invention is to provide the Web page decipherment system 
which can extract a text part correctly to the document of HTML containing the tag which has 
description of URL. 
[0007] 

[Means for Solving the Problem]A Web page decipherment system of this invention is provided 
with the following. 

A means which is a Web page decipherment system which decodes a text part of an HTML 
document which constitutes a Web page, takes out source data of foundations which 
constitute a Web page from a storage area specified by predetermined URL (uniform resource 
locator), and is written in a memory measure. 

A tag detection means which detects existence of a predetermined tag including a description 
part of URL out of basic source data. 

A URL detection means to detect URL in the predetermined tag when existence of a 
predetermined tag is detected, A means which takes out source data from a storage area 
specified by URL detected by a URL detection means, and is written in a memory measure, 
and a sampling-of-text means to extract a text part from all the source data memorized by 
memory measure. 

In the case of an HTML file in which basic source data contain a frame tag, from this 
composition, a text part of an HTML file stored in a field of URL described in that frame tag can 
also be extracted. 
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[0008] 

[Embodiment of the lnvention]Hereafter, the example of this invention is described in detail, 
referring to drawings. Drawing 1 shows the composition of the Web page decipherment system 
by that of this invention. In this system, WWW server 1 is a server which provides WWW as an 
information service, the HTML file is associated and saved by URL, and the graphics file and 
the voice file are also saved. WWW server 1 is connected to the Internet line network 2. 
[0009]The CTI (Computer Telephony Integration) server 3 is connected to the Internet line 
network 2. CTI server 3 is connected also to the dial-up line network 4. Although two or more 
telephones are actually connected to the dial-up line network 4, the one telephone 5 is shown 
here. Telephones may be any of an ordinary phone machine, a public telephone, and a 
portable telephone. Although the office for dialups, such as a relay station and a base station, 
exists in the dial-up line network 4, it is not shown in a figure. 

[0010]CTI server 3 is a server which provides with read-aloud of a Web page the user of the 
telephone containing the telephone 5. CTI server 3 is equipped with the Web page acquisition 
part 31, the sampling-of-text part 32, and the text read-aloud part 33. The Web page 
acquisition part 31 accesses WWW server 1, and acquires the source data of a Web page. 
The sampling-of-text part 32 analyzes the source data acquired by the Web page acquisition 
part 31 , and extracts a text part. The text read-aloud part 33 creates a synthesized speech 
signal according to the character code of a text, and outputs the synthesized speech signal to 
telephone using the dial-up line network 4. The Web page acquisition part 31, the sampling-of- 
text part 32, and the text read-aloud part 33 are formed by the operation like the after- 
mentioned of the processor (not shown) of CTI server 3. 

[001 1]CTI server 3 has the memory storage 35, such as a hard disk, inside, and various data, 
such as source data, is memorized by the memory storage 35 so that it may mention later, in 
communication using the Internet line network 2 of WWW server 1 and CTI server 3 each, 
TCP/IP is used as a communications protocol, and the IP address is respectively assigned to 
WWW server 1 and CTI server 3. HTTP is used as a protocol of WWW. Although not 
illustrated, WWW server 1 and CTI server 3 are connected to the Internet line network 2 via 
the router. 

[0012]Next, operation of this Web page decipherment system is explained. If a user 
telephones CTI server 3 from the telephone 5 and the talk state between the telephone 5 and 
CTI server 3 is established, as shown in drawing 2 , CTI server 3, First, in order to acquire the 
source data of the Web page of the field specified by URL defined beforehand, it accesses to 
the URL (Step S1). Supposing this URL is in WWW server 1, WWW server 1 will read the 
source data which consist of files, such as an HTML file etc. of the field specified by URL, and 
will transmit to CTI server 3. The source data are the source data used as the foundations 
which constitute a Web page. The transmitted source data are supplied to CTI server 3 via the 
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Internet line network 2. 

[0013]CTI server 3 stores in the memory storage 35 the source data sent from WWW server 1 
(Step S2), and extracts a FRAME (frame) tag out of the stored source data (Step S3). 
Although the extraction operation of this FRAME tag is mentioned later, URL contained in a 
FRAME tag is written in the memory storage 35 as a frame list. 

[0014]CTI server 3 distinguishes whether extraction of the FRAME tag was performed as a 
result of execution of Step S3 (step S4). When extraction of a FRAME tag is not actually 
performed in step S4, it progresses to the below-mentioned step S7. On the other hand, when 
extraction of a FRAME tag is actually performed, In order to acquire the source data of the 
Web page of the field specified by URL of a frame list, it accesses to URL of the frame list 
(Step S5), and the source data sent from WWW server 1 are stored in the memory storage 35 
(Step S6). Operation of WWW server 1 to access of Step S5 is the same as that of the case of 
access of Step S1 . After execution of Step S6 progresses to Step S7. 
[0015]CTI server 3 extracts text parts other than the portion surrounded by tab <> from the 
source data stored in the memory storage 35 in Step S7, reads out the extraction text data, 
and writes it in the memory storage 35 as data (Step S8). Then, a synthesized speech signal is 
created based on read-aloud data (step S9), and the synthesized speech signal is outputted to 
the telephone 5 (Step S10). since the read-aloud data written in the memory storage 35 is text 
data which consists of two or more character codes, the voice data corresponding to the 
character code group of the character codes of each and word unit is searched and obtained 
from the memory storage 35, and the synthesized speech signal which compounds these 
voice data and continues is created. A synthesized speech signal is supplied to the telephone 
5 via the dial-up line network 4, it reads out from the receiver of the telephone 5, and a sound 
is outputted. The data table showing the relation between a character code and voice data in 
the memory storage 35 is memorized beforehand. 

[0016]Next, it explains, referring to the flow chart of drawing 3 for the FRAME tag extraction 
operation out of the source data in the above-mentioned step S3. CTI server 3 searches 
character string <FRAME solvent refined coal out of the source data memorized by the 
memory storage 35 (Step S1 1), and distinguishes whether character string <FRAME solvent 
refined coal exists in source data (Step S12). That is, an HTML file is contained in the source 
data written in the memory storage 35, and it is distinguished whether frame settings are 
performed by the document which the HTML file shows. Since the syntax of <FRAME solvent 
refined coal="URL"> exists if character string <FRAME solvent refined coal exists, A read 
position is moved to the position of the following character = (Step S13), the character string 
surrounded by subsequent "", i.e., URL, is further read in source data, and the URL is written 
in the frame list formed in the memory storage 35 (Step S14). Therefore, URL which shows the 
existence position of the HTML document contained in a frame is written in a frame list. It 
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distinguishes whether search of character string <FRAME solvent refined coal was completed 
from all the files of source data after execution of Step S14 (Step S15), when the search is not 
completed, it returns to Step S11, and the above-mentioned step operation is repeated. 
[0017]Subsequently, it explains, referring to the flow chart of drawing 4 for the extraction 
operation of the text data in the above-mentioned step S7. CTI server 3 acquires the character 
code for one character sequentially from the head of one file in the source data stored in the 
memory storage 35 (Step S21) - the character code - a character - < - it is distinguished 
whether it is shown or not (Step S22). the acquired character code - a character -- < -- when 
shown, tag flag F TAQ is made equal to one (Step S23). the acquired character code - a 

character - < - when not shown, it is distinguished whether tag flag F TAQ is [ one ] equal 

(Step S24). It is a flag which tag flag F TAQ is set as 1 in the portion surrounded by <> in the 

HTML file, and is set as 0 in the other portion, and the initial value is 0. Distinction of Step S24 
is performed even after execution of Step S23. 

[0018]As a result of distinction of Step S24, when tag flag F JAG is made equal to one, the 

character code acquired at Step S21 distinguishes whether character > is shown (Step S25). 
Since it is the end of a tag when the acquired character code shows character >, tag flag F TAQ 

is made equal to zero (Step S26). On the other hand, as a result of distinction of Step S24, 
since it is text parts other than the tag portion surrounded by <> when tag flag F JAG is made 

equal to zero, the acquired character code is read out, and it stores in the memory storage 35 
so that it may be made to contain in data (Step S27). 

[0019]After Step S26 or execution of S27 distinguishes whether acquisition of the character 
code for one character was completed from all the source data (Step S28), when the 
acquisition is not completed, it returns to Step S21, and it repeats the above-mentioned step 
operation. Therefore, since the text part of the HTML file stored in the field of URL which was 
described in the FRAME tag in the case of the HTML file in which basic source data contain a 
FRAME tag can also be extracted according to this system, It can read out without leaving the 
text part of the Web page displayed on a WWW browser. 

[0020]ln the above-mentioned example, although the case where a frame tag was used was 
explained, this invention is applicable also to the tag which has description of URL other than a 
frame tag. This invention can be applied also when extracting and reading out a text part from 
the HTML file containing script languages, such as JavaScript. In HTML, in order to extend a 
function, the tag which enables it to perform script languages, such as JavaScript, on a Web 
page is also prepared. For example, it is formed by syntax like <SCRIPT 
LANGUAGE="JavaScript">- </SCRIPT>. Therefore, as described above, the portion of the 
range of <SCRIPT LANGUAGE="JavaScript">- </SCRIPT> is disregarded and the other text 
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part is extracted. 

[0021]The system of this invention is applicable also to the file using the language to which the 
file which used HTML was made to extend not only a case but HTML. Although all the pages 
that can be perused in a WWW browser were called the homepage in Japan, since the 
homepage was originally a basic page of two or more Web pages which constitute one 
information group, it was indicated with the Web page that misunderstanding was not invited 
here. 
[0022] 

[Effect of the lnvention]Like the above, a text part can be correctly extracted in the Web page 
decipherment system of this invention to the document of HTML containing the tag which has 
description of URL. Therefore, it can read out, without leaving the text part of a Web page, 
when reading out a Web page. 



[Translation done.] 
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* NOTICES * 

JPO and INPIT are not responsible for any 
damages caused by the use of this translation. 

1 This document has been translated by computer. So the translation may not reflect the 
original precisely. 

2.**** shows the word which can not be translated. 
3.ln the drawings, any words are not translated. 



DESCRIPTION OF DRAWINGS 

[Brief Description of the Drawings] 

[Drawing 1] lt is a block diagram showing the composition of the Web page decipherment 
system by this invention. 

[Drawing 2] lt is a flow chart which shows operation of the CTI server in the system of drawing 
1. 

[Drawing 3] lt is a flow chart which shows the FRAME tag extraction operation out of source 
data. 

[Drawing 4 ]lt is a flow chart which shows the extraction operation of text data. 
[Description of Notations] 

1 WWW server 

2 Internet line network 

3 CTI server 

4 Dial-up line network 

5 Telephone 



[Translation done.] 
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